Practical: early outbreak assessment, transmissibility and forecasting

Introduction

On the tidyverse

The tidyverse is a collection of R packages designed for data science. Developed with high standards of coding practices and software development, these packages form a consistent ecosystem to handle and visualise data. In this practical, we will mostly rely on the following packages:

  • dplyr for handling data, making new variables, building summaries, number-crunching

  • tidyr to re-arrange tidy/data

  • ggplot2 to visualise data

  • magrittr for the piping operator (%>%)

Most of these functionalities are summarised in handy cheatsheets. We provide links to the most relevant ones below:


Other packages

Other packages we will use include:

  • rmarkdown for automated report generation

  • rio to read .xlsx files in

Cheatsheets and other resources:

  • the rmarkdown cheatsheet

  • the knitr website documenting many options used in rmarkdown’s settings and code chunks


On the RECONverse

Several RECON packages will be used in the practical:

  • linelist for data cleaning

  • incidence older package to build and visualise epidemic curves, and do basic model fitting; it is being replaced by incidence2 but some older packages like earlyR (for estimating the reproduction number) and projections (for short term forecasting) are still using it.

  • incidence2 re-implementation of incidence, to build and handle epidemic curves

  • i2extras provides simplified wrappers for estimating growth rates using different GLMs from incidence2 objects

  • epicontacts to visualise and analyse contact tracing data or transmission chains

  • epitrix to estimate delay distributions

  • distcrete to handle delay distributions

  • earlyR for early-stage estimation of the reproduction number

  • projections for short-term forecasting

Note that earlyR and projections will soon be adapted to handle incidence2 objects.

For some of these packages, we recommend using the github (development) version, either because it is more up-to-date, or because they have not been released on CRAN yet.


What we will do today

This practical will use different outbreak data to illustrate how the various packages mentioned above can be combined to explore outbreak data and carry out analyses specific to the outbreak response context. Building on material seen in previous sessions, we will particularly look at:

  • cleaning / standardising ‘dirty’ data
  • building epidemic curves
  • derive empirical distributions of key delays
  • fit delays to discretized Gamma distributions using Maximum Likelihood, and summarising the results
  • visualise and analyse transmission chains / contact tracing data
  • estimate the growth rate and doubling time from epidemic curves
  • estimate the reproduction number from epidemic curvse
  • perform short-term forecasting of case incidence

In addition, we strongly recommend storing all analyses into one or several rmarkdown reports.


How this document works

This practical will take you through initial exploratory analyses of different datasets. Some parts will be used merely to illustrate a technical point (e.g. how to count cases by different groups), some will focus more on the epidemiological meaning of a result, and some will do both. These analyses are merely meant as a guide, and are by no means an exhaustive exploration of the data. Be curious: feel free to ask new questions and try to answer them. In all instances, try to first answer your own questions: try/error is a great way to learn R. But if you feel stuck, do not hesitate to questions to the demonstrators: they are here to help.

In this practical, most command lines are provided, but not all. In the few instances where code is not provided, you will typically need to re-use and adapt previous code to produce the result requested. In most cases, copy-paste should be sufficient to reproduce the results.

However, it is important that you understand what the code does. Look at relevant documentation, try tweaking the code, experiment, and do not hesitate to ask questions. It is not essential to go through all examples to illustrate important points, so take your time, especially if you are new to R. Also note that the solutions provided are by not means the only options for solving a given problem - feel free to explore alternatives and discuss them with the trainers.




A novel EVD outbreak in Ankh, Republic of Morporkia

This practical simulates the early assessment of an Ebola Virus Disease (EVD) outbreak. It introduces various aspects of data analysis of the early stage of an outbreak, including contact tracing data, characterisation of the serial interval, and estimation of the growth rate, doubling time and reproduction number from epidemic curves.

A new EVD outbreak has been notified in the small city of Ankh, located in the Northern, rural district of the Republic of Morporkia. Public Health Morporkia (PHM) is in charge of coordinating the outbreak response, and have contracted you as a consultant in outbreak analytics to inform the response in real time.


Importing data

While a new data update is pending, you have been given the linelist and contact data describing the early stages of the outbreak as xlsx files:

  • phm_evd_linelist_2017-10-27.xlsx: a linelist containing case information up to the 27th October 2017

  • phm_evd_contacts_2017-10-27.xlsx: a list of contacts reported between cases up to the 27th October 2017, where from indicates a potential source of infection, and to the recipient of the contact.

To import these files:

  1. create a data/ folder in your project directory

  2. download and save these files in data/

  3. import the files using here::here to find the path to the data, and rio::import to read the files and import them into R


Data cleaning

Once imported, the raw data should look like:

Examine these data and try to list existing issues.

What is wrong with the raw data?

The raw data, whilst very small in size, have a number of issues, including:

  • capitalisation: mixture of upper case and lower cases

  • dates: their format is very heterogeneous, so that dates are not readily useable

  • special characters: accents, various separators and trailing / heading white spaces

  • typos / inconsistent coding: gender is sometimes indicated in full words, sometimes abbreviated; location has some obvious abbreviations and typos

We will use the function clean_data() from the linelist package to clean these data. This function will:

  • set all characters to lower case
  • use _ as universal separator
  • remove all heading and trailing separators
  • replace all accentuated characters by their closest ASCII match
  • detect date formats and convert entries to Date when relevant (see argument guess_dates)
  • replace typos and recode variables (see argument wordlists)

For this last part, we need to load a separate file containing cleaning rules: phm_evd_cleaning_rules.xlsx.

Examine this file (and read the ‘explanations’ tab), import it as the other files, and save the resulting data.frame as and object called cleaning_rules. The output should look like:

We can now clean our data using clean_data; execute and interprete the following commands:


Looking at transmission chains


Exercise: contact tracing is at the centre of an Ebola outbreak response. Using the same approach seen in the previous practical, build an epicontacts object using the function make_epicontacts, and store the results as x.


You can easily plot these contacts, but with a little bit of tweaking (see ?vis_epicontacts) you can customise shapes by gender:


Question: What can you say about these contacts? Are there evidences of super-spreading? Can you estimate it at this stage?


Explanation:

The transmission chains for the first few cases shows one super-spreading event; there is not sufficient data at this stage to fully characterise the offspring distribution, but we will need to keep an eye out for further super-spreading events, and possibly investigate these further to design targetted interventions.


Looking at incidence curves

The first question PHM asks you is simply: how bad is it?. Given that this is a terrible disease, with a mortality rate nearing 70%, there is a lot of concern about this outbreak getting out of control. The first step of the analysis lies in drawing an epicurve, i.e. an plot of incidence over time.


In principle, the new package incidence2 should be chosen to build epidemic curves. However, here we shall use its older brother incidence, which provides some facilities for estimating growth rates which will later be used.

We compute daily incidence based on the dates of symptom onset, storing results in an object called i:


Exercise: If you pay close attention to the dates on the x-axis, you may notice that something is missing. Indeed, the graph stops right after the last case, but field epidemiologists from PHM insist that the data should be complete until the 27th October 2017. Try to fix this using the argument last_date in the incidence function.



Estimating the growth rate (r)

The simplest model of incidence is probably the log-linear model, i.e. a linear regression on log-transformed incidences. In the incidence package, the function fit will estimate the parameters of this model from an incidence object (here, i), and derive estimate of doubling / halving times and associated confidence intervals:


Question: How would you interpret this result? What criticisms would you make of this model?


Explanation:

The log-linear regression model is not appropriate when there are days of zero incidence. In this case, these data are ignored, which can lead to biases in the estimation of the growth rate. In addition, the fact that the model will use less data points (ignore days with no incidence) means the confidence intervals associated to r will be larger. Other models, such as a Poisson GLM, or branching processes, should be considered.


Estimating the reproduction number (R)

Branching process models can be used to estimate the reproduction number (R), given incidence data and a serial interval distribution. As you have already built an incidence object, the only missing information relates to the serial interval distribution. As current data are insufficient to estimate it, you can use previously published estimates with a mean of 15.3 days and a standard deviation 9.3 days (WHO Ebola Response Team 2014).


Exercise: Look at the documentation of the function get_R from the earlyR package, and identify the arguments corresponding to the incidence data, the mean and standard deviations of the serial interval. Run the function with the correct arguments, and store the output as a new object called R; results should resemble:


You can visualise the results as follows:

The first figure shows the distribution of likely values of R, and the Maximum-Likelihood (ML) estimation. The second figure shows the global force of infection over time, with dashed bars indicating dates of onset of the cases.


Question: Interpret these results: what do you make of the reproduction number? What does it reflect? Based on the last part of the epicurve, some colleagues suggest that incidence is going down and the outbreak may be under control. What is your opinion on this?


Explanation:

The estimation of the reproduction number clearly suggests that it is significantly greater than 1. There may have not been any cases over the last few days, but results suggest an increase in cases is actually very likely.


Short-term forecasting

The function project from the package projections can be used to simulate plausible epidemic trajectories by simulating daily incidence using the same branching process as the one used to estimate \(R_0\) in earlyR. All that is needed is one or several values of \(R_0\) and a serial interval distribution, stored as a distcrete object. As this one is included in the output of earlyR, we have all the necessary inputs to carry out these simulations.

Here, we illustrate how we can simulate 5 random trajectories using a fixed value of \(R_0\) = 3.02, the ML estimate of \(R_0\):


Exercise: Read the commands below and interpret the results.



Question: According to these results, what are the chances that more cases will appear in the near future? Is this outbreak being brought under control? Would you recommend scaling up / down the response?


Explanation:

Short term forcasts suggest an exponential growth is likely. The apply function, used to summarise the 5000 simulations on each day, provides numbers accompanying the plot. The second use of the function calculates, for each day, the proportion of simulations showing at least one case (mean(x > 0)), which varies from 84% to 97%. Given these results, scaling up operations seems like a good idea.



COVID-19 in the UK

For the second part of this practical, we look at real data reporting COVID-19 hospitalisation in England, broken down to the hospital and National Health Service (NHS) region level. The data is available online from the NHS England’s website. The dataset analysed here is a simplified version, providing incidence of hospital admissions by NHS trust, stored in the file covid_admissions_uk_2020_10_24.xlsx.


Distribution of cases by NHS region

Here, we try to get an overall picture of the distribution of hospital admissions across England. The following code sums the number of cases reported by region:


Exercise: Adapt this code to derive total number of cases by region and NHS trust, and provide a visual representation of the resulting numbers and interpret the results; your plot may resemble:



Epidemic curves

Unlike linelists of cases, this dataset contains pre-calculated counts by date, NHS region and trust. While incidence2 is primarily designed to handle linelist-like data, it can also convert pre-computed incidences using the count argument:


Exercise: Use facet_plot, tweaking the arguments angle, format and the one specifying the number of dates on the x-axis, to reproduce the plot below; which do you think is best at representing dynamics of COVID-19 in England?


Explanation:

Both plots are interesting, with different emphasis: the colored plot is better for assessing the global tendencies, while the facetted plot is better for appreciating individual dynamics of the regions.


Growth rates

The new package i2extras adds a number of tools for incidence2 objects, including the implementation of different GLMs to fit epicurves, and derive growth rates.

To illustrate these approaches, we first select the data so that:

  • we use daily incidences by region
  • we keep the last 4 weeks of data
  • we exclude the last week of data, as it may be prone to biases due to reporting delays


Exercise: We can now fit models to these data. To do so, try doing the following:

  1. load the package i2extras; if not installed, follow instructions from the package’s website

  2. use the function fit_curve() to fit a negative binomial model to the incidence data by region, save the output as a new object last_trends, and plot it

  3. feed this output to the function growth_rate() to obtain estimates of growth rates, doubling times and their confidence intervals, for each region; the results should resemble:


The code below can be used to visualise the results:


Exercise: repeat the same figure with doubling times. Taken together, what do these results suggest about the evolution of COVID-19 in the UK over the next weeks? What would be the doubling time for growth rates (r) close to 0? When would you recommend using doubling times for communicating results on the evolution of an epidemic?


Explanation:

These results show that all regions display significant growth, and we expect cases to keep growing exponentially in the absence of changes in transmission. Doubling times are fine as all confidence intervals of r exclude zero. Otherwise, confidence intervals for doubling times would effectively be exclusion intervals, which are a lot harder to communicate. Doubling times should only be reported when the outbreak is significantly growing.



References

WHO Ebola Response Team. 2014. “Ebola Virus Disease in West Africa–the First 9 Months of the Epidemic and Forward Projections.” N. Engl. J. Med. 371 (16): 1481–95.